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Abstract: We propose a generalized double Pareto prior for Bayesian shrinkage estimation 
and inferences in linear models. The prior can be obtained via a scale mixture of Laplace 
or normal distributions, forming a bridge between the Laplace and Normal- Jeffreys' priors. 
While it has a spike at zero like the Laplace density, it also has a Student's t-like tail be- 
havior. Bayesian computation is sttaightforward via a simple Gibbs sampling algorithm. We 
investigate the properties of the maximum a posteriori estimator, as sparse estimation plays 
an important role in many problems, reveal connections with some well-established regular- 
ization procedures, and show some asymptotic results. The performance of the prior is tested 
through simulations and an application. 

Key words and phrases: Heavy tails, high-dimensional data, LASSO, maximum a posteriori 
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1. Introduction 

There has been a great deal of work in shrinkage estimation and simultaneous 
variable selection in the frequentist framework. The LASSO of Tibshirani (1996) has 
drawn much attention to the area, particularly after the introduction of LARS (Efron et 
al. (2004)) due to its superb computational performance. There is a rich literature an- 
alyzing the LASSO and related approaches (Fu (1998), Knight and Fu (2000), Fan and 
Li (2001), Yuan and Lin (2005), Zhao and Yu (2006), Zou (2006), Zou and Li (2008)), 
with a number of articles considering asymptotic properties. 

Bayesian approaches to the same problem became popular with the works of Tip- 
ping (2001) and Figueiredo (2003). By expressing Student's t priors for basis coef- 
ficients as scale mixtures of normals (West (1987)), and relying on type II maximum 
likelihood estimation (Berger (1985)), Tipping (2001) developed the relevance vector 
machine for sparse estimation in kernel regression. In this setting, however, exact spar- 
sity comes with the price of forfeiting propriety of the posterior by driving the scale 
parameter of the Student's t distribution toward zero. In fact, driving both the scale 
parameter and the degrees of freedom to zero yields the so-called Normal-Jeffreys' 
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prior, Tr{6) oc l/\9\. The name emerges due to the fact that the hierarchy follows as 
9 ~ N(0, r), 7r(r) oc 1/r, where the latter is the Jeffreys' prior on the prior variance 
of 6. Figueiredo (2003) proposed an expectation-maximization algorithm for maxi- 
mum a posteriori estimation under Laplace and Normal- Jeffreys' priors, with estimates 
under the Laplace corresponding to the LASSO. The Normal- Jeffreys' prior leads to 
substantially improved performance with finite samples due to the property of strongly 
shrinking small coefficients to zero while minimally shrinking large coefficients due to 
the heavy tails; however, it has no meaning from an inferential aspect as it leads to an 
improper posterior. 

A Bayesian LASSO was proposed by Park and Casella (2008) and Hans (2009). 
However, these procedures inherit the problem of over-shrinking large coefficients due 
to the relatively light tails of the Laplace prior. Strawderman-Berger priors (Strawder- 
man (1971), Berger (1980)) have some desirable properties yet lack a simple analytic 
form. Recently proposed priors have been designed to have high density near zero and 
heavy tails without the impropriety problem of Normal-Jeffreys. The horseshoe prior of 
Carvalho, Poison, and Scott (2009, 2010) is induced through a carefully-specified mix- 
ture of normals, leading to such desirable properties as an infinite spike at zero and very 
heavy tails. They studied sparse shrinkage estimation properties of the horseshoe in a 
normal means problem. Griffin and Brown (2007, 2010) proposed an alternative class 
of hierarchical priors for shrinkage with some similarities to the prior we propose, but it 
lacks a simple analytic form that facilitates the study of some properties. 

There is a need for alternative shrinkage priors that lead to sparse point estimates 
if desired, do not over-shrink coefficients that are not close to zero, facilitate straight- 
forward computation even in large p cases, and result in a joint posterior distribution 
that does a good job of quantifying uncertainty. We propose the generalized double 
Pareto prior which independently finds mention in Cevher (2009). It has a simple ana- 
lytic form, yields a proper posterior, and possesses such appealing properties as a spike 
at zero. Student's t-like tails, and a simple characterization as a scale mixture of nor- 
mals that leads to a straightforward Gibbs sampler for posterior inferences. We consider 
both fully Bayesian and frequentist penaUzed likeUhood approaches based on this prior. 
We show that the induced penalty in the regularization framework yields a consistent 
thresholding rule having the continuity property in the orthogonal case, with a simple 
expectation-maximization algorithm described for sparse estimation in non-orthogonal 
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cases. In another independent work motivated by applications to genome wide associ- 
ations studies, Lee et al. (2011) consider a generalized t prior (McDonald and Newey 
(1988)) that includes the generalized double Pareto as a special case. Similarities to 
previous work are limited and our contributions beyond them are (i) the formal intro- 
duction of a generalized Pareto density, thresholded and folded at zero, as a shrinkage 
prior in Bayesian analysis, (ii) the scale mixture representation of the generalized double 
Pareto in Proposition 1 which is central to our work, (iii) its connection to the Laplace 
and Normal-Jeffreys' priors as limiting cases in Proposition 2, (iv) the resulting fully 
conditional posteriors in a linear regression setting along with a simple Gibbs sampling 
procedure, (v) a detailed discussion on the hyper-parameters a and rj and their treatment, 
along with the incorporation of a griddy sampling scheme into the Gibbs sampler, (vi) 
a detailed analysis of the induced penalty by the generalized double Pareto prior and 
the properties of the resulting thresholding rule, (vii) an explicit analytic form for the 
maximum a posteriori estimator in orthogonal cases, (viii) an expectation-maximization 
procedure to obtain the maximum a posteriori estimate in non-orthogonal cases using the 
normal mixture representation, (ix) the one-step estimator (Zou and Li (2008)) resulting 
from the Laplace mixture representation, revealing the connection of the resulting pro- 
cedure to the adaptive LASSO of Zou (2006), and (x) the oracle properties of the resulting 
estimators. 

2. Generalized Double Pareto Prior 

The generalized double Pareto density is 
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-(a+l) 



/(*a) = ^ll + ;^| , (2.1, 



where ^ > is a scale parameter and a > is a shape parameter. In contrast to (2.1 1, 
the generalized Pareto density of Pickands (1975) is parametrized in terms of a location 
parameter /i G M, a scale parameter ^ > 0, and a shape parameter a G M as 

f{0\^,a,fi) = -i^l + ^j , (2.2) 

with > /i for a > and /i < ^ < ^ — for a < 0. The mean and variance 
for the generalized Pareto distribution are K{9) = + ^/(l — 1/a) for a ^ [0, 1] and 



Y{e) = ^^(1 - l/a)"^(l - for q ^ [0,2]. If we let = 0, (2.2i becomes an 
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exponential density as q — ?■ oo with mean ^ and variance 

To modify the generalized Pareto density to be a shrinkage prior, we let /x = and 
reflect the positive part about the origin, assuming a > 0, for a density that is symmetric 
about zero. The mean and variance for the generalized double Pareto distribution are 
E{e) = for a > 1 and Y{9) = 2C'^a'^{a - l)~^{a - 2)-^ for a > 2. The dispersion 
is controlled by ^ and a, with a controlling the tail heaviness and a = 1 corresponding 
to Cauchy-like tails and no finite moments. 

Figure 2. 1 compares the density in ( |2. 1| ) to Cauchy and Laplace densities for the 



special case ^ = a = 1, so that f{e) = 1/{2(1 + \9\f}. We refer to this form 
as the standard double Pareto. Near zero, the standard double Pareto resembles the 
Laplace density, suggesting similar sparse shrinkage properties of small coefficients in 
maximum a posteriori estimation. It also has Cauchy-like tails, which is appealing in 
avoiding over-shrinkage away from the origin. This is illustrated in Figure |2[TJ a). Figure 
|2.1fc ) illustrates how the density in (2. 1 1 changes for different values of ^ and a. 



Prior ( |2.1| ) can be represented as a scale mixture of normal distributions leading to 
computational simplifications. As shorthand notation, let 6 ~ GDP(^, q) denote that 6 
has density ( |2.1| ). 

Proposition 1. Let 6 ~ A'^(0, r), r ~ Exp{\^ /2), and A ~ Ga{a, 7]), where q > and 
r] > 0. The resulting marginal density for 9 is GDP{^ = rj/a, a). 



Proposition 1 reveals a relationship between the prior in ( 2. 1 1 and the prior of Griffin 
and Brown (2007), with the difference being that Griffin and Brown (2007) place a 
mixing distribution on leading to a mai^ginal density on 9 with no simple analytic 
form. 



In Proposition 2 we show that the prior in ( 2. 1 1 forms a bridge between two limiting 
cases - Laplace and Normal- Jeffreys' priors. 

Proposition 2. Given the representation in Proposition 1,9^ GDP{^ = r]ja^a) implies 

1. f{9) ocl/\9\fora = 0andr] = 0, 

2. f{9\X') = (A72) exp {-X'\9\)for a ^ oo, a/r? = A' and < A' < oo. 

Proof. For the first item, setting a = rj = implies placing a Jeffreys' prior on A, 
7r(A) oc 1/A. Integration over A yields 7r(r) oc 1/r, which implies the Normal- Jeffreys' 
prior on 9. For the second item, notice that 7r(A) = 6{X — A'), where 5{.) denotes the 
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Figure 2.1: (a) Probability density functions for standard double Pareto (solid line), standard 
Cauchy (dashed line) and Laplace (dot-dash line) (A = 1) distributions, (b) Probability density 
functions for the generalized double Pareto with a) values of (1, 1) (solid line), (0.5, 1) 
(dashed hne), (1, 3) (long-dashed line), and (3, 1) (dot-dash line). 



Dirac delta function, since limQ,_!.oo ^^^a/rj^x' IE(A) = A' and linia^oo lini^/iy^A' — 
0. Thus, /o°°(A/2)exp(-A|e|)(^(dA) = (A72) exp (-A'|e|). □ 

As noted in Poison and Scott (2010), if 7r(r) has exponential or lighter tails, obser- 
vations are shrunk towards zero by some non-diminishing amount, regardless of size. 
This phenomenon is well-understood and commonly observed in estimation under the 
Laplace prior, where an exponential density mixes a normal density. The higher-level 
mixing (over A) in Proposition 1 allows 7r(r) to have heavier tails, remedying the un- 
wanted bias. 

As a grows, the density becomes Ughter tailed, more peaked and the variance be- 
comes smaller, while as 77 grows, the density becomes flatter and the variance increases. 
Hence if we increase a, we may cause unwanted bias for large signals, though causing 
stronger shrinkage for noise-like signals; if we increase 77 we may lose the ability to 



6 



ARTIN ARMAGAN, DAVID B. DUNSON AND JAEYONG LEE 



shrink noise-like signals, as the density is not as pronounced around zero; and finally, 
if we increase a and 77 at the same rate, the variance remains constant but the tails be- 
come lighter, converging to a Laplace density in the limit. This leads to over- shrinking 
of coefficients that are away from zero. As a typical default specification for the hyper- 
parameters, one can take a = t] = \. This choice leads to Cauchy-like tail behavior, 
which is well-known to have desirable Bayesian robustness properties. 

To motivate this default choice, we assess the behavior of the prior shrinkage factor 
K = 1/(1 + r) G (0, 1), where Q ~ N(0, r) is the parameter of interest (Carvalho et 
al. (2010)). As K — 0, the prior imposes no shrinkage, while as k — )■ 1 it has a strong 
puU towards zero. The generalized double Pareto distribution implies a prior ti{k) on k 
upon integration over A in Proposition 1. For the standard double Pareto, this is 



where Erfc(.) denotes the complementary error function. In Figure 2.2, we compare 
7r(K) under the standard double Pareto, Strawderman-Berger, horseshoe, and Cauchy 
priors, which may all be considered default choices. The priors behave similarly for k k, 
0, implying similar tail behavior. The behavior of vr(K) for k f« 1 governs the strength of 
shrinkage of small signals. As k — > 1, 7r(«;) tends towards zero for the Cauchy, implying 
weak shrinkage, while 7r(K) is unbounded for the horseshoe, suggesting a strong pull 
towards zero for small signals. The Strawderman-Berger and standard double Pareto 
priors are a compromise between these extremes, with 7r(/«) bounded for k ^ 1 in 
both cases. The standard double Pareto assigns higher density close to one than the 
Strawderman-Berger prior, and has the advantage of a simple analytic form over the 
Strawderman-Berger and horseshoe priors. 

Of course it is best to adjust a and 77 according to any available prior information 
pertaining to the sparsity structure of the estimated vector. For general a > and 77 > 
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Figure 2.2: Prior density of k implied by the standard double Pareto prior (solid line), 
Strawderman-Berger prior (dashed line), horseshoe prior (dot-dash Une) and standard Cauchy 
prior (dotted line). 
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Figure 2.3: Prior density of k (a) when a = 1 and rj = 0.5 (dashed), rj = 1 (sohd), r] = 2 
(dot-dash) (b) when 77 = 1 and a = 1 (solid), a = 2 (dashed), a = 3 (dot-dash). 



values, the prior on k is 



Tr{K\a, rj) 
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where iFi denotes the confluent hypergeometric function. Note that 7r{K\a,ri) takes 
a "horseshoe" shape when a = rj = 0. Carvalho, Poison, and Scott (2010) show that 
7r(«;) oc K^^{1 — k)^^ implies a Normal-Jeffreys' prior on 9, which can also be observed 
by setting a = rj = in ( |2.3| ) in conjunction with Proposition 1. Hence Tr{K\a,rj) is 
unbounded at k = 1 forcing TT{6\a, rj) to be unbounded at only if 77 = 0. The effects 



of a and t] are now observed with better clarity from Figure 2.3 As r] increases, less 
and less density is assigned to the neighborhood of k ?s 1, repressing shrinkage. On 
the other hand, increasing a values place more and more density in the neighborhood 
of K f« 1 promoting further shrinkage. This notion is later reinforced by Proposition 3, 
such that the prior induces a thresholding rule under maximum a posteriori estimation 



ifrj < 2\Ja + 1. Hence, we need to carefully pick these hyper-parameters, in particular 
a, as there is a trade-off between the magnitude of shrinkage and tail robustness. 

3. Bayesian Inference in Linear Models 

Consider the linear regression model y = X/3 + e, where y is an n-dimensional 
vector of responses, X is the n x p design matrix and e ~ N (O, cr^In)- Letting /3j |(t ~ 
GDP(,^ = (Jri/a, a) independently for j = 1, . . . ,p. 

From Proposition 1, this prior is equivalent to /3j| a ~ N(0, cx^rj), with Tj ~ Exp(Aj/2) 
and \j ~ Ga(a, -r]). We place the Jeffreys' prior on the error variance, vr(cj) oc l/o". 

Using the scale mixture of normals representation, we obtain a simple data aug- 
mentation Gibbs sampler having the conditional posteriors (/3|cj^, T, y) ~ N{(X'X + 
T-i)-iX'y,aMX'X + T-i)-^}, (a2|/3,T,y) ~ IG{(n + p)/2, (y - X/3)'(y - 
X/3)/2 + /3'T-i/3/2}, (A,-|/3„a2) ~ Ga(a + l,|/3,-|/a + r,), (rri|/3,, A„ a^) ~ 
Inv-Gauss{/z = (A^(T^//3|)^/^, p = A^}, where T = diag(Ti, . . . , r^) and Inv-Gauss 
denotes the inverse Gaussian distribution with location and scale parameters /i and p. In 
our experience, this Gibbs sampler is efficient with fast rates of convergence and mixing. 

In the absence of any prior information on a and rj, one may either set them to their 
default values or, as an alternative, choose hyper-priors to allow the data to inform about 
the values of a and rj. We use 7r(a) = 1/ (1+a)^ and 7r(?7) = 1/ to correspond to 
generalized Pareto hyper-priors with location parameter 0, scale parameter 1 and shape 
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parameter 1. The median value of the resulting distribution for a and r; is 1, centered at 
the default choices suggested earlier, while the mean and variance do not exist. 

For sampUng purposes, let a = 1/(1 +a) ande = 1/(1+7/). These transformations 
suggest a uniform prior on a and e in (0, 1) given the generaUzed Pareto priors on a and 
•q. Consequently, the conditional posteriors for a and e are 



We propose the embedded griddy Gibbs (Ritter and Tanner (1992)) sampUng scheme: 

i. Form a grid of m points a'^^\. . . , a^™) in the interval (0,1). 

ii. Calculate ui^'^) = 7r(aW|/3, ??). 

iii. Normalize the weights, = w'^'^^ / J2T=i w^'^K 

iv. Draw a sample from the set {a*^^) , . . . , a*^"*) } with probabihties {w^^^ , . . . , w^^^}, 
and set a = 1/a — 1 to be used at the current iteration of the Gibbs sampler. 

Repeat the same procedure for e and obtain a random draw for rj. We also experiment 
with fixing ry as 1 while treating a as unknown. In this case, the prior variance of /3|(T^ 
is determined by a. 

In what follows we establish the ties between the Bayesian approach we have taken 
and some frequentist regularization approaches. The simple analytic structure of the 
generalized double Pareto prior facilitates analyses while its hierarchical formulation 
leads to straight-forward computation. 

4. Sparse Maximum a Posteriori Estimation 

The generalized double Pareto distribution can be used not only as a prior in a 
Bayesian analysis, but also to induce a sparsity-favoring penalty in regularized least 
squares: 





(4.1) 



10 



ARTIN ARMAGAN, DAVID B. DUNSON AND JAEYONG LEE 



where X is initially assumed to have orthonormal columns and p{.) denotes the penalty 
function implied by the prior on the regression coefficients. Following Fan and Li 



(2001), let f3 = X'y, and denote the minimization problem in (4.1 1 for a component 
of j3 as 

= argmin | ^ - + aMW)^ , (4-2) 

with the penalty function p (I /3j I) = (a + l)log{ar] + |/3j|) that simply retains the term 
in — log7r(/3j|a, 77) that depends on f3j. 

From Fan and Li (2001), a good penalty function should result in an estimator that 
is (i) nearly unbiased when the true unknown parameter is large, (ii) a thresholding rule 
that automatically sets small estimated coefficients to zero to reduce model complexity, 
and (iii) continuous in data to avoid instability in model prediction. In the following. 



we show that the penalty function induced by prior ( |3.1| ) may achieve these properties. 
4.1. Near-unbiasedness 



The first order derivative of (4.2 1 with respect to /3j is sgn{/3j){\/3j\ -\-a'^p'{\/3j\)} — 
= sgn(/3,){|/3,| + a^a + + |/3,-|)} - p^, where = dp{\P^\) / d\P,\ 

is the term causing bias in estimation. Although it is appealing to introduce bias in 
small coefficients to reduce the mean squared error and model complexity, it is also 
desirable to limit the shrinkage of large coefficients with p'{\Pj\) — >• as |/3j| — 00. In 
addition, it is desirable for p'{\Pj\) to approach zero rapidly, implying shrinkage, and the 
associated introduction of bias rapidly decreases as coefficients get further away from 
zero. In fact, the rate of convergence of p'(|/3j|) to zero is of the same order under the 
generalized double Pareto and Normal- Jeffreys' priors, with lim|^^ |_>oo{(a + 1)/ {(Trj + 
= a + 1- As a controls the tail heaviness in the generalized double 
Pareto prior, with lighter tails for larger values of a, convergence of the ratio to (a + 1) 
is intuitive. In the case of LASSO, the bias, p'{\j3j\), remains constant regardless of |/3j|. 



which can also be observed in Figure 4.4 b) 



4.2. Sparsity 

As noted in Fan and Li (2001), a sufficient condition for the resulting estimator to 
be a thresholding rule is that the minimum of the function + a'^p'{\j3j\) is positive. 



Proposition 3. Under the formulation in Proposition 1, prior {3.11 implies a penalty 
yielding an estimator that is a thresholding rule ifri< 2\/a + 1. 
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This result is obtained by finding the minimum of \(3j\ + cr^p' and taking it 
greater than zero. The thresholding is a direct consequence of the fact that when < 

min^^, { I /3j I + 0-2 (a + 1 ) / (fj?? + 1 /3j I ) } , which requires that min^^ { | /3j | + a'^p' ( | /3j | ) } > 0, 



the derivative of (4.2i is positive for all positive /3j and negative for all negative Pj. In 
this case, the penalized least squares estimator is zero. When > min/3^{|/3j| + 
o"^ (a + 1 ) / (fir/ + I /3j I ) } , two roots may exist. The larger one (in absolute value) or zero 
is the penalized least squares estimator. To elaborate more on this, the root(s) may exist 
for sgn(/3,){|/3,| + ^7V(|/3,|)} - /3, = only when |/3,| > min^J|/3,| + aVd/^il)}- 
A helpful illustration is Figure 3 of Fan and Li (2001). 

4.3. Continuity 

Continuity in data is important if an estimator is to avoid instabilities in prediction. 
As in Breiman (1996), "a regularization procedure is unstable if a small change in data 
can make large changes in the regularized estimator". Discontinuities in the thresholding 
rule may result in inclusion or dismissal of a signal with minor changes in the data used 
(see Figure |4.4| ^b)). Hard-thresholding, the "usual" variable selection, is an unstable 
procedure, while ridge and LASSO estimates are considered stable. 

A necessary and sufficient condition for continuity is that the minimum of the func- 
tion + a'^p' {\fij\) is at zero (Fan and Li (2001)). For our prior, the minimum of 
this function is obtained at = a{yja + 1 — ?]). Therefore r/ = \Ja + 1 yields an 
estimator with this property. 



Proposition 4. Under the formulation in Proposition 1, a subfamily of prior {2.1 ) with 
rj = \/a + 1 yields an estimator with the continuity property. 

In this particular case, the penahzed likelihood estimator is set to zero if < 
a^/a + 1. When > a\/a + 1, 



/3j-(Tv^^+{^2_^2/3j(tv^S+T-3ct2(q+1)}1/2 
^ ^ /3j+ctV^-{/32-2/3j(Tv^^-3(t2(q+1)}1/2 . (4-3) 



> 0, 
/3, < 0. 



As can be observed in Figure 4.4 a), ensuring continuity by letting tj = ^/a + 1 creates 
a trade-off between sparsity and tail-robustness. As the thresholding region becomes 
wider, the larger values are penalized further, yet not nearly at the level of LASSO. 
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(a) 



(b) 



Figure 4.4: Thresholding functions for (a) generalized double Pareto prior with r] = \/a+ 1, 
a = {1, 3, 7}, (b) Hard thresholding, generalized double Pareto prior with t] = 2, a = 3 and 
LASSO with a = 1. 



4.4. Maximum a Posteriori Estimation via Expectation-Maximization 

We assume a normal likelihood to formulate the procedure for non-orthogonal lin- 
ear regression. Estimation is carried out via the expectation-maximization (EM) algo- 
rithm. 

4.4.1. Exploiting the Normal Mixture Representation 

We take the expectation of the log-posterior with respect to the conditional posterior 



distributions of (r~^|/3] \ Aj, (T^Cfe)) and (Aj|/3] V^C^)) at the A;th step, then maximize 
with respect to Pj and cr^ to get the values for the {k + l)th step. 



E-step: 



{a + 



|#|(|#|+aWr?), 



4"^ 
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M-step: Letting D^'^) = Amg{d!{ \ . . . , dp*"^), we have 
^(fc+i) ^ (X'X + D('=))-iX'y, 



a 



n+p + 2 



We refer to this estimator as GDP(MAP). 

4.4.2. Exploiting the Laplace Mixture Representation and the One-step Estimator 

In the proof of Proposition 1, the integration over r leads to a Laplace mixture 
representation of the prior. Since the mixing distribution of the Laplace is a known 
distribution the required expectation is obtained with ease, resulting in the maximization 
step, 



^(fc+i) 

argmax { --^ttv (y - ^P)' (j - ^f^) 



a 



0-2(^+1) 



2ac - Vb^ - "^ad? 



(4.4) 



2a2 



where a = -{n + p + 2),h = {a + 1)^. \^f^^\/{\pf^\/'j'^^'^ + r?), and c = (y - 
X/3{fc+i))'(y _ X/3('=+^)). The component-specific multiplier on is obtained from 
the expectation of Xj with respect to its conditional posterior distribution, 7r(Aj|/3j, cr^). 



Similar results to (|4^ are in Candes, Wakin and Boyd (2008), Cevher (2009), and 
Garrigues (2009). 

An intuitive relationship to the adaptive LASSO of Zou (2006) and the one-step 
sparse estimator of Zou and Li (2008) can be seen via the Laplace mixture representa- 
tion. As a computationally fast alternative to estimating the exact mode via the above 
EM algorithm, we can obtain a "one-step estimator" and exploit the LARS algorithm as 
in Zou and Li (2008). The one-step estimator is 

/3*'^^ = arg min < 



/3 



y - X/3)' (y - X/3) + at ^ Jf^'' ) , (4.5) 

j=i I + ff 
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with = 2a'^^^\a + 1) and t/^ = a^^^^rj. This estimator resembles the adaptive LASSO. 
The LARS algorithm can be used to obtain /3^^^ very quickly. We refer to this estimator 
as GDP(OS). 

Remark 1. For rj^ = 0, the GDP(OS) solution path for varying is identical to the 
adaptive LASSO solution path with 7 = 1 (see (4) in Zou (2006)) using identical f3^^\ 
Remark 2. GDP(OS) forms a bridge between the LASSO and the adaptive LASSO: as tj'^ — ;> 
00 and ayr]^^X^<oo, GDP(OS) gives the LASSO solution with penalty parameter A^. 

We derive the GDP(OS) estimator only to reveal a close connection with the adaptive 
LASSO of Zou (2006) and do not use it in our experiments. 

4.4.3. Normal vs. Laplace Representations in Computation 

As pointed out by an anonymous referee, it is appropriate to compare the conver- 
gence behavior of the EM algorithms that exploit different mixture representations. We 
generated n = {200, 400, 600, 800, 1000} observations from yt = x'-fB* + a, where the 
Xij were independent standard normals for p = {20,40,60,80,100}, ~ N(0,cr^), 
and a = 3. We set the first p/A components of /3* to be 1 and the rest to 0. For each 
(n, p) combination we simulated 100 data sets and ran the EM algorithms obtained from 



normal and Laplace scale mixture representations. Figure 4.5 illustrates the number of 
iterations taken by the two algorithms until 11/3^'^+^) — (3^''^ II2 < 10^^. As expected, the 
convergence under the Laplace mixture representation was much faster with the inter- 
mediary mixing parameter tj integrated out rather than using the expectation step in the 
EM algorithm. 

4.5. Oracle Properties 

Following Zou (2006) and Zou and Li (2008), we show that the GDP(MAP) and 
GDP(OS) estimators possess oracle properties. Relaxing the normality assumption on the 
error term leads to two conditions for Theorem 2 and Theorem 3. 

(Al) ?/j = Xj/3* + ti where ei, . . . , e„ are independent and identically distributed with 
mean and variance a^. 

(A2) ^X'X — )• C, where C is a positive definite matrix. 



In what follows, A = {j : (3* 7^ 0, j = 1, . . . ,p}, retains the entries of (3 indexed 
by A, and C_4 retains the rows and columns of C indexed by A. 
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Figure 4.5: Number of iterations until convergence of the EM algorithms under normal and 
Laplace representations. 

Theorem 1. Let 



Pt^ = arg mm <j (y - X/3)' (y - X/3) + < log (|/3, | + ry^) 



denote the GDP(MAP) estimator, where a'^ = 2(T^(a„ + 1) and rj^ = arin- Let An = {j '■ 

nj 

Then fSl^^ is 



f^ni'^ 7^ 0, j = 1, . . . ,p}. Suppose that a'„ —5- oo, a'^/ ^/rl — )• and, vj^\fn — )• c < oo. 



1. consistent in variable selection in that Xya^n^ca IF'(-^n — A) = 1; 

2. asymptotically normal with y/n{l3^^ — A-N{0, a'^C^). 

Remark 3. More generally, the above results hold if a'^/ {y/ni]'^) — ^ oo and q^/ ^/n 
0. 



Theorem 2. Let (3n^ denote the GDP(OS) estimator in (4.5) and An = {j ■ f^n] / 



0, J = 1, . . . ,p}. Suppose that alt — ^ oo, alil ^fn — )• 0, and r]^n\fn — )• c < oo. Then 
f^'h^ is 

L consistent in variable selection in that lim„_s.oo IP'(-4.„ = A) = 1; 
2. asymptotically normal with \/^(/3n^ ~ (3*a) -^-^(0, a'^C~^). 
The proofs are deferred to Section 8. 
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5. Experiments 
5.1. Simulation 

In this section, we compare the proposed estimators to the posterior means obtained 
under the normal, Laplace, and horseshoe priors, to the Bayesian model averaged (BMA) 
estimator, as well as to the sparse estimates resulting from LASSO (Tibshirani (1996)) and 
SCAD (Fan and Li (2001)). GDP(PM) and GDP(MAP) denote the posterior mean and the 
MAP estimates, respectively, under the generalized double Pareto prior. Hyper-parameter 



values are provided in footnotes of Tables 5.1 and 5.2 when fixed in advance and are 
otherwise treated as random with the priors specified in Section 3. When not fixed, 
we first obtain the posterior means of the hyper-parameters from an initial Bayesian 
analysis, then use them in the calculation of the MAP estimates. 

We generated n = {50, 400} observations from in = + £», where the xij were 
standard normals with Covfj;,, x,/ ) = 0.5l-'-^'l, ei ~ N(0, 0-2), and cj = 3. We used the 
following (3* configurations: 

Model 1 : 5 randomly chosen components of /3* set to 1 and the rest to 0. 
Model 2: 5 randomly chosen components of /3* set to 3 and the rest to 0. 
Model 3: 10 randomly chosen components of (3* set to 1 and the rest to 0. 
Model 4: 10 randomly chosen components of (3* set to 3 and the rest to 0. 
Models-. (3* = (0.85, ...,0.85)'. 

In our experiments y and the columns of X were centered and the columns of X 
scaled to have unit length. For the calculation of competing estimators we used lars 
(Hastie and Efron (2011)), SIS (Fan et al. (2010)), monomvn (Gramacy (2010)) and 
BAS (Clyde and Littman (2005), Clyde, Ghosh, and Littman (2010)) packages in R. 
We mainly followed the default settings provided by the packages. Under the normal 
prior, the so-called "ridge" parameter was given an inverse gamma prior with shape and 
scale parameters 10^'^. Under the Laplace prior, as a default choice, a gamma prior 
was placed on the "LASSO parameter" A^, as given in (6) of Park and Casella (2008), 
with shape and rate parameters 2 and 0.1, respectively. Under the horseshoe prior, the 
monomvn package uses the hierarchy given in Section 1.1 of Carvalho, Poison, and 
Scott (2010). For BMA, we used the default settings of the BAS package that employs 
the Zellner-Siow prior given in Section 3.1 of Liang et al. (2008). The tuning for LASSO 
and SCAD were carried out by the criteria given in Yuan and Lin (2005) and Wang, Li, 
and Tsai (2007), respectively, avoiding cross-validation. 
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100 data sets were generated for each case. In Table |5.1[ we report the median 
model error. Model error was calculated as {(3* — (3)'C{(3* — (3), where C is the 
variance-covariance matrix that generated X and denotes the estimator in use. The 
values in the subscripts give the bootstrap standard error of the median model error val- 
ues obtained. The bootstrap standard error was calculated by generating 500 bootstrap 
samples from 100 model error values, finding the median model error for each case, and 
then calculating the standard error for it. Under each model, the best three performances 
are boldfaced in the tables. 

GDP(PM) estimates showed a similar performance to that of horseshoe under sparse 
setups. GDP(PM) (with a and -q unknown) also showed great flexibility in adapting to 
dense models with small signals. GDP(MAP) estimates performed similarly to SCAD and 
much better than LASSO, particularly so with increasing sparsity, signal and/or sample 
size. The GDP(PM) and GDP(MAP) calculations are straightforward and computationally 
inexpensive due to the normal (and Laplace) scale mixture representation used. Being 
able to use a simple Gibbs sampler (especially when a = r] = 1) makes the procedure 
attractive for the average user. 

Letting a = r] = 1 may be somewhat restrictive if the underlying model is very 
dense or very sparse, but in the cases we considered, it performed comparably to others 
and we believe that it constitutes a good default prior similar to standard Cauchy with 
the added advantage of thresholding ability. Although we do not take up p » n cases in 
this paper, in such situations much larger values of a would need to be chosen to adjust 
for multiplicity. 

5.2. Inferences on Hyper-parameters 

Here we take a closer look at the inferences on the hyper-parameters obtained from 
an individual data set for Models 2 and 5 from Section 5.1. This gives us some insight 
into how a and rj are inferred with changing sample size and sparsity structure. Note 
that GDP(PM)^ is more restrictive than GDP(PM) as 77 is fixed, treating only a as unknown. 



Figure 5.6 gives the marginal posteriors of a and -q in cases of GDP(PM) and GDP(PM) 



as described in Section 5.1, while Table 5.2 reports the posterior means for a and r/, as 



well as model error (ME) performance (as calculated in Section 5.1) on the particular 
data set used. We clearly observe the adaptive nature and higher flexibility of GDP(PM) 
moving from a sparse to a dense model with a big increase, particularly in r], flattening 
the prior on (3. There is not quite as much wiggle room in the case of GDP(PM)^. All it 
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Table 5.1: Model error comparisons. 



n = 50 



Method 


Model 1 


Model 2 


Model 3 


Model 4 


Model 5 


Normal 


2.299o 


085 


4.879o 


263 


2.585o.i34 


4.972o 


385 


2.8860. 


150 


Laplace 


2.634o 


137 


3.662o 


233 


2.837o.i26 


4.326o 


211 


3.458o. 


120 


Horseshoe 


2.264o. 


086 


2.3I60 


167 


3.205o.i4o 


3.929o 


218 


4.409o 


130 


BMA 


2.451o 


123 


1.647o. 


126 


4.043o.233 


3.062o. 


194 


6.015o 


301 


GDP(PM)^ 


2.306o 


114 


2.405o 


192 


3.1930.215 


4.123o 


304 


4.283o 


142 


GDP(PM)^ 


2.303o 


095 


2.309o 


195 


3.1240.153 


3.9IO0 


237 


4.45I0 


109 


GDP(PM) 


2.271o. 


085 


2.6O60 


167 


3.047o.i47 


4.348o 


171 


3.640o. 


134 


GDP(MAP)^ 


3.414o 


148 


1.619o. 


150 


5.605o.298 


2.97O0. 


168 


8.769o 


403 


GDP(MAP)2 


4.250o 


354 


I.6I80. 


153 


6.331o.3oa 


3.O4O0. 


163 


9.3O80 


377 


GDP(MAP) 


4.876o 


355 


2.O9I0 


182 


4.299o.222 


3.74O0 


284 


5.724o 


177 


LASSO 


2.183o. 


124 


2.6I80 


152 


3.2580.194 


3.53I0 


172 


5.646o 


229 


SCAD 


3.732o 


214 


2.132o 


229 


5.2490.239 


3.179o 


193 


8.505o 


387 


n = 400 


Normal 


0.395o 


014 


0.455o 


019 


0.4260.016 


0.455o 


024 


0.412o. 


013 


Laplace 


0.315o 


016 


0.374o 


014 


O.3880.016 


0.422o 


015 


0.457o. 


014 


Horseshoe 


0.219o 


016 


0.205o 


010 


O.34I0.014 


0.346o 


009 


0.514o 


023 


BMA 


0.151o. 


Oil 


0.125o 


005 


O.24O0.016 


O.2II0 


009 


0.646o 


037 


GDP(PM)^ 


0.233o 


016 


O.2O60 


009 


0.326o.oi5 


0.284o 


014 


0.625o 


031 


GDP(PM)^ 


0.228o 


017 


0.215o 


009 


0.332o.oi3 


0.303o 


010 


0.579o 


027 


GDP(PM) 


0.248,, 


017 


0.182o 


007 


0.377o.oi6 


0.362o 


012 


0.466o. 


016 


GDP(MAP)^ 


0.154o. 


014 


O.lllo. 


Oil 


0.2860. 016 


O.2IO0. 


Oil 


0.739o 


043 


GDP(MAP)^ 


O.I6I0 


013 


O.lllo. 


010 


0.284o.oi6 


O.2IO0. 


009 


0.652o 


035 


GDP(MAP) 


0.185o 


017 


0.119o 


010 


0.3260. 016 


0.336o 


010 


0.478o 


020 


LASSO 


O.25I0 


014 


0.276o 


014 


0.339o.o2o 


0.348o 


oil 


0.485o 


021 


SCAD 


O.12I0. 


010 


O.II80. 


008 


0.233o.oii 


O.2O60. 


017 


0.469o 


019 



^ a = 1, rj = 1; ^ r] = 1 



can do is to drive a smaller to allow heavier tails to accommodate a dense structure. As 
observed in Table 5.1 however, GDP(PM)^ performs comparably in sparse cases. 



Table 5.2: Posterior means of the hyper-parameters and the resulting model error. 







n = 


50 


n = 


400 






GDP(PM) 


GDP(PM)^ 


GDP(PM) 


GDPCPM)"" 


Model 2 


a 


2.464 


1.165 


0.688 


0.870 




V 


4.181 




0.614 






ME 


2.443 


2.219 


0.149 


0.181 


Model 5 


a 


5.262 


1.200 


9.400 


0.560 




V 


9.476 




51.735 






ME 


6.290 


7.019 


0.518 


0.614 



^r; = 1 
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6. Data Example 

We consider the ozone data analyzed by Breiman and Friedman (1985) and by 
Casella and Moreno (2006). The original data set contains 13 variables and 366 obser- 
vations. The modeled response is the daily maximum one-hour averaged ozone reading 
in Los Angeles over 330 days in 1976. There are p = 12 predictors considered and 
deleting incomplete observations leaves n = 203 observations. For validation, the data 
were split into a training set containing 180 observations and a test set containing 23 
observations. We considered models including main effects, quadratic, and two-way 
interaction terms resulting in 2^*^ possible subsets. The complex correlation structure of 



the data is illustrated in Figure 6.7 



Figure 6.8 summarizes the performance of the proposed estimators and their com- 
petitors. Median values for Rtg^t and the ±2 standard error intervals were obtained by 
running the methods on 100 different random training-test splits. Standard errors were 
computed via bootstrapping the medians 500 times. 

The median number of predictors retained in the model by all three GDP(MAP) esti- 
mates was only 4 while it was 14 and 9 for LASSO and SCAD. Hence GDP(MAP) promoted 
much sparser models. In terms of prediction, GDP(PM)^ yielded the second best results 
after BMA, with GDP(PM)^, GDP(PM), and the horseshoe estimator all having somewhat 
worse performance. These shrinkage priors are designed to mimic model averaging be- 
havior, so we expected to obtain results that were competitive with, but not better than, 
BMA. The improved performance for GDP(PM)^ may be attributed to the use of default 
hyper-parameter values that were fixed in advance at values thought to produce good 
performance in sparse settings. Treating the hyper-parameters as unknown is appealing 
from the standpoint of flexibility, but in practice the data may not inform sufficiently 
about their values to outperform a good default choice. GDP(MAP)^ and SCAD both per- 
formed within the standard error range of LASSO, while retaining a smaller number of 
variables in the model. As it is important to account for model uncertainty in prediction, 
the posterior mean estimator under the GDP prior is appealing in mimicking BMA. In 
addition, obtaining a simple model containing a relatively small number of predictors 
is often important, since such models are more likely to be used in fields in which pre- 
dictive black boxes are not acceptable and practitioners desire interpretable predictive 
models. 
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7. Discussion 

We have proposed a hierarchical prior obtained through a particular scale mixture 
of normals where the resulting marginal prior has a folded generahzed Pareto density 
thresholded at zero. This prior combines the best of both worlds in that fully Bayes in- 
ferences are feasible through its hierarchical representation, providing a measure of un- 
certainty in estimation, while the resulting marginal prior on the regression coefficients 
induces a penalty function that allows for the analysis of frequentist properties under 
maximum a posteriori estimation. The resulting posterior mean estimator can be ar- 
gued to be mimicking a Bayesian model averaging behavior through mixing over higher 
level hyper-parameters. Although Bayesian model averaging is appealing, it can be ar- 
gued that allowing parameters to be arbitrarily close to zero instead of exactly equal to 
zero may be more natural in some problems. Hence we have a procedure that not only 
bridges two paradigms - Bayesian shrinkage estimation and regularization - but also 
yields three useful tools: a sparse estimator with good frequentist properties through 
maximum a posteriori estimation, a posterior mean estimator that mimics a model aver- 
aging behavior, and a useful measure of uncertainty around the observed estimates. In 
addition, the proposed methods have substantial computational advantages in relying on 
simple block-updated Gibbs sampling, while BMA requires sampUng from a model space 
with T' models. Given the simple and fast computation and the excellent performance 
in small sample simulation studies, the generahzed double Pareto should be useful as a 
shrinkage prior in a broad variety of Bayesian hierarchical models, while also suggest- 
ing close relationships with frequentist penalized likelihood approaches. The proposed 
prior can be used in generalized linear models, shrinkage of basis coefficients in non- 
parametric regression, and in such settings as factor analysis and nonparametric Bayes 
modeling. 

8. Technical Details 



Proof of Theorem 1. The proof follows along similar lines as does the proof of Theorem 
2 in Zou (2006). We first prove asymptotic normahty. Let (3 = j3* + n/ ^Jn and 
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Let u„ = arg min Vn{u), suggesting u„ = y/n{Pl^ ' - (3*). Now 



K(u) - K(0) = u' ( -X'x") u - 2^u + a; V log 

j=i 



and we know that X'X/n C and e'X/^ 4 W = N(0, cj^C). Consider the limiting 
behavior of the third term, noting that lima_5.oo(l + b/a)°- = e^. If /?* / 0, then 



"n log IfS'l+n'^ < «n log 



1/3* l+r?; 



0. If /?* 



0, then a'n log 



a'n log (^1 + ^ which is if uj = 0, and diverges 



otherwise. By Slutsky's Theorem 

d { u^C^u^ - Iu'j^Wa if Uj = 0\/jiA 



VJvi) 



K(0) 



00 



otherwise. 



l^(u)— y„(0) is convex and the unique minimum of the righthand side is (C^^ W_4, 0)'. 
By epiconvergence (Geyer (1994), Knight and Fu (2000)), 



0. 



(8.1) 



Since W_4 = N(0, a'^Cj^), this proves asymptotic normality. 

Now Vj G A, pI^^ A-13*; thus P(j G An) 1. Hence for consistency, it is 
sufficient to show that V/ ^ A, P(/ G 0. Consider the event j' £ An- By the 

KKT optimality conditions, 2x' , (y — X/3^°°^ 



by (8.1 



(^),. Noting that V^/3i;;?Ho 



00, while 
2x',(y-X/3^r)) 



2 < 



n 



By (8.1 1 and Slutsky's Theorem, we know that both terms in the brackets converge in 
distribution to some normal, so 



(j'G A) <p|2x;., (y-X/3(r) 
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This concludes the proof. 



□ 



Proof of Theorem!. We modify the proof of Theorem 2 in Zou (2006). Here /3 
denotes the least squares estimator. We first prove asymptotic normahty. Let /3 
/3* + u/v^ and 



Vn{n) 



(0) 



Let Un = argmin V„(u), suggesting u„ = ^A^(/3n ^ - fiV)- Now 
Vn{n)-Vn{Q) = u'f-X'xV-2^u 



+ 



Or 



and we know that X'X/n ^ Cande'X/^4 W = N(0,(72C). Consider the limiting 
behavior of the third term. If ^ then, by the Continuous Mapping Theorem, 

{\P^ni\ + ni}-'^Wni\ + rir' and V^d/?* - + u^/^\ - ^ n,sgn(/3*.). 



By Slutsky's Theorem, {al/ ^){\p'^>\ + vi]'^ V^{\P*n, + ^jlM " I/?:,!) ^0. If 



0, then V^(|/3* . + - 1/3* . I) = |n,| and aUl/?!?! + ^U'W^ 



«n/(V^I/5S^I + V^^n), Where 



Op(l). Again by Slutsky's Theorem, 



F„(u)-F„(0)- 



u^C^u^ - 2u^W^ if = for all j 
oo otherwise. 



Vn{u.)—Vn(0) is convex and the unique minimum of the righthand side is (C^^W^, 0)'. 
By epiconvergence (Geyer (1994), Knight and Fu (2000)), 



(8.2) 



Since = N(0, cr^C^), this proves the asymptotic normahty. 

Now Vi G A, l^ll- ^Plf thus P(j G An) 1. We show that for all j' ^ A, 
i' G An) —5- 0. Consider the event j' G An- By the KKT optimality conditions. 



2x'.,(y - X/3i')) = aUlAi^l + '7")-^ We know that al(|/3^°]| + ri)-^/^^oo. 
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while 



2x'.,(y-X^ _ 



n 



x;.,XVn(/3;-/3i'^) ^ x;.,e 



n \ n 



By (8.2 1 and Slutsky's Theorem, we know that both terms in the brackets converge in 



distribution to some normal, so 

P(/eA)<p{2x;,(y-X3W)=^^ 

which proves consistency. □ 
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Figure 5.6: Inferences for (a) GDP(PM) for n = 50 under Model 2, (b) GDP(PM)2 for n = 50 
under Model 2, (c) GDP(PM) for n = 400 under Model 2, (b) GDP(PM)2 for n = 400 under 
Model 2, (e) GDP(PM) for n = 50 under Model 5, (f) GDP(PM)2 for n = 50 under Model 5, (g) 
GDP(PM) for n = 400 under Model 5, (h) GDP(PM)2 for n = 400 under Model 2 . 
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Figure 6.7: The correlation stiTicture of the Ozone data. 
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Figure 6.8: Out-of-sample performance comparisons for Ozone data, (x) denotes the median 
value for R^e^t while the hues represent the ±2 standard error regions, = 1, 77 = 1; ^77 = 1. 



